13 research outputs found
Learning to Relate from Captions and Bounding Boxes
In this work, we propose a novel approach that predicts the relationships
between various entities in an image in a weakly supervised manner by relying
on image captions and object bounding box annotations as the sole source of
supervision. Our proposed approach uses a top-down attention mechanism to align
entities in captions to objects in the image, and then leverage the syntactic
structure of the captions to align the relations. We use these alignments to
train a relation classification network, thereby obtaining both grounded
captions and dense relationships. We demonstrate the effectiveness of our model
on the Visual Genome dataset by achieving a recall@50 of 15% and recall@100 of
25% on the relationships present in the image. We also show that the model
successfully predicts relations that are not present in the corresponding
captions.Comment: ACL 201
5IDER: Unified Query Rewriting for Steering, Intent Carryover, Disfluencies, Entity Carryover and Repair
Providing voice assistants the ability to navigate multi-turn conversations
is a challenging problem. Handling multi-turn interactions requires the system
to understand various conversational use-cases, such as steering, intent
carryover, disfluencies, entity carryover, and repair. The complexity of this
problem is compounded by the fact that these use-cases mix with each other,
often appearing simultaneously in natural language. This work proposes a
non-autoregressive query rewriting architecture that can handle not only the
five aforementioned tasks, but also complex compositions of these use-cases. We
show that our proposed model has competitive single task performance compared
to the baseline approach, and even outperforms a fine-tuned T5 model in
use-case compositions, despite being 15 times smaller in parameters and 25
times faster in latency.Comment: Interspeech 202
DEXTER: Deep Encoding of External Knowledge for Named Entity Recognition in Virtual Assistants
Named entity recognition (NER) is usually developed and tested on text from
well-written sources. However, in intelligent voice assistants, where NER is an
important component, input to NER may be noisy because of user or speech
recognition error. In applications, entity labels may change frequently, and
non-textual properties like topicality or popularity may be needed to choose
among alternatives.
We describe a NER system intended to address these problems. We test and
train this system on a proprietary user-derived dataset. We compare with a
baseline text-only NER system; the baseline enhanced with external gazetteers;
and the baseline enhanced with the search and indirect labelling techniques we
describe below. The final configuration gives around 6% reduction in NER error
rate. We also show that this technique improves related tasks, such as semantic
parsing, with an improvement of up to 5% in error rate.Comment: Interspeech 202